%load_ext lab_black
import warnings
warnings.filterwarnings("ignore")
A decision tree is a simple, easy to interpret modeling technique for both regression and classification problems. Decision trees usually do not perform very well compared to other methods. Their relevance lies in the fact that they are the building blocks of two of the most successful ML algorithms as per today: random forests and gradient boosted trees. In this chapter, we will introduce these tree-based methods.
In our journey to estimate the model $f$ by $\hat f$, we have considered mainly linear functions $f$ so far. We now move to a different function class: decision trees. They have been introduced in 1984 by Leo Breiman, Jerome Friedman and others [1] and are sometimes called "Classification and Regression Trees" (CART).
(Binary) decision trees are calculated recursively by partitioning the data in two pieces. Partitions are chosen to optimize the given average loss by asking the best "yes/no" question about the covariates, e.g., "is carat < 1?" or "is color better than F?".
For regression problems, the most frequently used loss function is the squared error. For classification, its "information" (= cross-entropy = log loss = half the unit logistic deviance) or the very similar Gini impurity.
Predictions are calculated by sending an observation through the tree, starting with the question at the "trunk" and ending in a "leaf". The prediction is the value associated with the leaf. For regression situations, such leaf value typically equals the average response of all observations in the leaf. In classification settings, it may be the most frequent class in the leaf or all class probabilities.
The concept of a decision tree is best understood with an example.
We will use the dataCar data set to predict the claim probability with a decision tree. As features, we use veh_value, veh_body, veh_age, gender, area and agecat.
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
# Load data
car = pd.read_csv("car.csv") # see readme how to get the data
# Preprocessing pipe. For simplicity, we treat 'veh_body' as ordinal
num_vars = ["veh_value", "veh_age", "agecat"]
ord_vars = ["gender", "area", "veh_body"]
all_vars = ord_vars + num_vars # Use same order in preprocessor(!)
ord_levels = [sorted(car[v].unique()) for v in ord_vars]
preproc = ColumnTransformer(
transformers=[
("ordinal", OrdinalEncoder(categories=ord_levels), ord_vars),
("numeric", "passthrough", num_vars),
]
)
X = preproc.fit_transform(car)
# Feature matrix
pd.DataFrame(X, columns=all_vars).head()
| gender | area | veh_body | veh_value | veh_age | agecat | |
|---|---|---|---|---|---|---|
| 0 | 0.0 | 2.0 | 3.0 | 1.06 | 3.0 | 2.0 |
| 1 | 0.0 | 0.0 | 3.0 | 1.03 | 2.0 | 4.0 |
| 2 | 0.0 | 4.0 | 12.0 | 3.26 | 2.0 | 2.0 |
| 3 | 0.0 | 3.0 | 10.0 | 4.14 | 2.0 | 2.0 |
| 4 | 0.0 | 2.0 | 3.0 | 0.72 | 4.0 | 2.0 |
# Fit model
model = DecisionTreeClassifier(criterion="entropy", max_depth=3)
model.fit(X, car["clm"])
print("First five predictions (probabilities for no claim/claim):")
print(model.predict_proba(X[0:5]))
First five predictions (probabilities for no claim/claim): [[0.9333264 0.0666736 ] [0.9333264 0.0666736 ] [0.92529273 0.07470727] [0.92529273 0.07470727] [0.9333264 0.0666736 ]]
# Visualize and "read" tree
import numpy as np
import matplotlib.pyplot as plt
print("Fitted decision tree: if condition is 'yes', go to left!")
plt.figure(figsize=(16, 10))
tree.plot_tree(
model, feature_names=all_vars, proportion=True, filled=True, impurity=False
)
plt.show()
print("\nFirst observation (original and encoded)")
(
car[all_vars]
.head(1)
.append(pd.Series(X[0], index=all_vars), ignore_index=True)
.set_index(np.array(["Original", "Encoded"]))
)
Fitted decision tree: if condition is 'yes', go to left!
First observation (original and encoded)
| gender | area | veh_body | veh_value | veh_age | agecat | |
|---|---|---|---|---|---|---|
| Original | F | C | HBACK | 1.06 | 3.0 | 2.0 |
| Encoded | 0.0 | 2.0 | 3.0 | 1.06 | 3.0 | 2.0 |
Comments
agecat <= 4.5) chosen? The algorithm scans all covariates for all possible split positions and picks the one with best average loss improvement. In this case, splitting on covariate agecat at the value 4.5 reduced the average loss most.Properties of decision trees
These properties typically translate 1:1 to combinations of trees like random forests or boosted trees.
In 2001, Leo Breiman introduced a very powerful tree-based algorithm called random forest, see [2]. A random forest consists of many decision trees. To ensure that the trees differ, two sources or randomness are injected:
Predictions are found by pooling the predictions of all trees, e.g., by averaging or majority voting.
Comments about random forests
Let us now fit a random forest for diamond prices with typical parameters and 500 trees. 80% of the data is used for training, the other 20% we use for evaluating the performance. We use stratified splitting on the response variable. (Throughout the rest of the lecture, we will ignore the problematic aspect of having repeated rows for some diamonds.)
from plotnine.data import diamonds
from sklearn.preprocessing import OrdinalEncoder, KBinsDiscretizer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Stratified train/test split
binner = KBinsDiscretizer(n_bins=10, encode="ordinal")
y_binned = binner.fit_transform(diamonds[["price"]])
df_train, df_test, y_train, y_test = train_test_split(
diamonds, diamonds["price"], test_size=0.2, random_state=341, stratify=y_binned
)
# Pipeline with model
ord_features = ["cut", "color", "clarity"]
ord_levels = [diamonds[x].cat.categories.to_list() for x in ord_features]
model = make_pipeline(
ColumnTransformer(
transformers=[
("ordinal", OrdinalEncoder(categories=ord_levels), ord_features),
("numeric", "passthrough", ["carat"]),
]
),
RandomForestRegressor(
n_estimators=500,
max_features="sqrt",
min_samples_leaf=5,
oob_score=True,
random_state=17,
n_jobs=4,
),
)
model.fit(df_train, y_train)
# OOB performance
print(f"OOB R-squared: {model[-1].oob_score_:.2%}")
OOB R-squared: 98.07%
# Test performance
rmse_test = mean_squared_error(y_test, model.predict(df_test), squared=False)
print(f"Test RMSE: {rmse_test:.3f}")
print(f"Test R-squared: {model.score(df_test, y_test):.2%}")
Test RMSE: 554.735 Test R-squared: 98.05%
Comments
In contrast to a single decision tree or a linear model, a combination of many trees is not easy to interpret. It is good practice for any ML model to study at least variable importance and the strongest effects, not just its performance. A pure prediction machine is hardly of any interest and might even contain mistakes like using covariates derived from the response. Model interpretation helps to fight such problems and thus also to increase trust in a model.
There are different approaches to measure the importance of a covariate. Since there is no general mathematical definition of "importance", the results of different approaches might be inconsistent across each other. For tree-based methods, a usual approach is to measure how many times a covariate $X$ was used in a split or how much total loss improvement came from splitting on $X$.
Approaches that work for any supervised model (including neural nets) include permutation importance and SHAP importance.
One of the main reasons for the success of modern methods like random forests is the fact that they automatically learn interactions between two or more covariates. Thus, the effect of a covariate $X$ typically depends on the values of other covariates. In the extreme case, the effect of $X$ is different for each observation. The best what we can do is to study the average effect of $X$ over many observations, i.e., averaging the effects over interactions. This leads us to partial dependence plots: They work for any supervised ML model and are constructed as follows: A couple of observations are selected. Then, their average prediction is visualized against $X$ when sliding their value of $X$ over a reasonable grid of values, keeping all other variables fixed. The more natural the Ceteris Paribus clause, the more reliable the partial dependence plots.
Remark: A partial dependence plot of a covariate in a linear regression is simply a visualization of its coefficient.
Alternatives to partial dependence plots include accumulated local effect plots and SHAP dependence plots. Both relax the Ceteris Paribus clause.
For our last example, we will now look at variable importance and partial dependence plots.
# Variable importance
import pandas as pd
# Replace later with: feature_names = model[:-1].get_feature_names_out()
feature_names = ord_features + ["carat"] # same(!!) order as in pipeline
imps = pd.Series(model[-1].feature_importances_, index=feature_names)
_ = imps.sort_values().plot(kind="barh", title="Split gain importance")
# Partial dependence plots
# Note: As soon as sklearn.inspect.PartialDependenceDisplay
# allows for passing a character grid, it is an option as well.
import dalex as dx
import plotly
plotly.offline.init_notebook_mode() # for saving html with plotly plots
# Define explainer
exp = dx.Explainer(model, data=df_train, verbose=False)
# Plots
pdp_num = exp.model_profile(
variables=["carat"], label="Partial depencence for 'carat'", verbose=False
)
pdp_num.plot()
pdp_ord = exp.model_profile(
variable_type="categorical",
variable_splits=dict(zip(ord_features, ord_levels)),
label="Partial depencence for ordinal variables",
verbose=False,
)
pdp_ord.plot(facet_scales="free")
Comments
carat is the most important predictor.clm using covariates veh_value, veh_body, veh_age, gender, area, and agecat. Choose a suitable tree depth either by cross-validation or by minimizing OOB error on the training data. Make sure to fit a probability random forest, i.e., predicting probabilities, not classes. Additionally, make sure to work with a relevant loss function (information/cross-entropy or Gini gain). Evaluate the final model on an independent test dataset. Interpret the results by split gain importance and partial dependence plots.The idea of boosting was introduced by Schapire in 1990 [3] and roughly works as follows: A simple model is fit to the data. Then, another simple model is added, trying to correct the errors from the first model. This process is repeated until some stopping criterion triggers. As simple models or base learners, usually small decision trees are used, an idea pushed by Jerome Friedman in his famous 2001 article on the very general framework of gradient boosting machines [4].
Modern variants of such gradient boosted trees are XGBoost, LightGBM and CatBoost. These are the predominant algorithms in ML competitions on tabular data, check this comparison for differences with a screenshot as per Sept. 25, 2020:
Predictions are calculated similar to random forests, i.e., by combining predictions of all trees. As loss/objective function, one can choose among many possibilities. Often, using the same loss function as a corresponding GLM is a good choice.
As an initial example on gradient boosting and XGBoost, we fit a model for diamond prices with squared error as loss function. The number of rounds/trees is chosen by early stopping on the validation data, i.e., until validation (R)MSE stops improving for a couple or rounds. The learning rate (weight of each tree) is chosen by trial and error in order to end up with a reasonably small/large number of trees, see the next section for more details.
# We can use either the native XGBoost interface or its Scikit-Learn
# interface. We will use the former.
import xgboost as xgb
from plotnine.data import diamonds
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
# Train/valid split
df_train, df_valid, y_train, y_valid = train_test_split(
diamonds, diamonds["price"], test_size=0.2, random_state=937
)
# Preprocessing
ord_features = ["cut", "color", "clarity"]
ord_levels = [diamonds[x].cat.categories.to_list() for x in ord_features]
encoder = make_column_transformer(
(OrdinalEncoder(categories=ord_levels), ord_features), ("passthrough", ["carat"])
)
X_train = encoder.fit_transform(df_train)
X_valid = encoder.transform(df_valid)
# XGBoost data interface
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)
# Compact set of parameters
params = {"learning_rate": 0.02, "objective": "reg:squarederror", "eval_metric": "rmse"}
# Fit model until validation RMSE stops improving over the last 20 rounds
fit = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=5000, # a large value
evals=[(dtrain, "train"), (dvalid, "valid")], # the last is used for early stopping
verbose_eval=40,
early_stopping_rounds=20,
)
[0] train-rmse:5481.28027 valid-rmse:5536.88574 [40] train-rmse:2540.34790 valid-rmse:2578.91577 [80] train-rmse:1272.18420 valid-rmse:1306.55017 [120] train-rmse:766.21503 valid-rmse:800.65320 [160] train-rmse:592.72235 valid-rmse:625.90753 [200] train-rmse:538.89673 valid-rmse:571.34314 [240] train-rmse:520.53619 valid-rmse:553.62933 [280] train-rmse:512.15106 valid-rmse:546.68897 [320] train-rmse:507.19961 valid-rmse:544.09937 [360] train-rmse:502.98166 valid-rmse:543.11127 [400] train-rmse:499.58090 valid-rmse:542.69501 [413] train-rmse:498.39648 valid-rmse:542.73810
Comments:
Gradient boosted trees offer a quite a lot of parameters. Unlike with random forests, they need to be tuned to achieve good results. It would be naive to use an algorithm like XGBoost without parameter tuning. Here is a selection:
Number of boosting rounds: In contrast to random forests, more trees/rounds is not always beneficial because the model begins to overfit after some time. The optimal number of rounds is usually found by early stopping, i.e., one lets the algorithm stop as soon as the (cross-)validation performance stops improving, see the example above.
Learning rate: The learning rate determines training speed and the impact of each tree to the final model. Typical values are between 0.01 and 0.5. In practical applications, it is set to a value that leads to a reasonable amount of trees (100-1000). Usually, halving the learning rate means twice as much boosting rounds for comparable performance.
Regularization parameters: Depending on the implementation, additional parameters are
Reasonable regularization parameters are chosen by trial and error or systematically by randomized or grid search CV. Usually, it takes a couple of iterations until the range of the parameter values have been set appropriately.
Note on tuning: Since learning rate, number of boosting rounds and regularization parameters are heavily interdependent, a "big" randomized grid search CV to choose learning rate, boosting rounds and regularization is often not ideal. Above suggestion (fix learning rate, select number of rounds by early stopping and do grid search only on regularization parameters) is more focussed, see also the example below.
We will use XGBoost to fit diamond prices using the squared error as loss function and RMSE as evaluation metric, now using a sophisticated tuning strategy.
Overall, the modelling strategy is as follows:
# This code is pretty long, but may serve as a general template
import json
from pathlib import Path
import pandas as pd
import xgboost as xgb
from plotnine.data import diamonds
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
# General settings
grid_file = Path("gridsearch") / "diamonds_xgb.txt"
objective = "reg:squarederror"
metric = "rmse"
# Train/test split
df_train, df_test, y_train, y_test = train_test_split(
diamonds, diamonds["price"], test_size=0.2, random_state=341
)
# Preprocessing
ord_features = ["cut", "color", "clarity"]
ord_levels = [diamonds[x].cat.categories.to_list() for x in ord_features]
encoder = make_column_transformer(
(OrdinalEncoder(categories=ord_levels), ord_features), ("passthrough", ["carat"])
)
X_train = encoder.fit_transform(df_train)
X_test = encoder.transform(df_test)
# XGBoost data interface (feature_names are added for importance plot)
feature_names = ord_features + ["carat"]
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_names)
# Step 1: Find learning rate with reasonable tree count (0.02 seems ok)
if True:
params = {"learning_rate": 0.02, "objective": objective, "eval_metric": metric}
# Cross-validation
cvm = xgb.cv(
params=params,
dtrain=dtrain,
num_boost_round=5000,
nfold=5,
early_stopping_rounds=20,
callbacks=[xgb.callback.EvaluationMonitor(period=50)],
)
# A LR of 0.02 provides about 370 trees, which is a convenient amount
print("Best boosting round with default params:\n", cvm.tail(1))
[0] train-rmse:5495.84414 test-rmse:5495.78760
[50] train-rmse:2121.54155 test-rmse:2125.91128
[100] train-rmse:960.68479 test-rmse:972.60244
[150] train-rmse:619.88821 test-rmse:641.85356
[200] train-rmse:538.46134 test-rmse:566.77708
[250] train-rmse:516.40282 test-rmse:549.19456
[300] train-rmse:507.12587 test-rmse:544.62821
[350] train-rmse:500.17528 test-rmse:543.55730
[392] train-rmse:495.25938 test-rmse:543.67495
Best boosting round with default params:
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
372 497.539026 1.778335 543.511145 8.908143
# Step 2: Iterate randomized SearchCV for regularization parameters
if False:
from tqdm import tqdm
from sklearn.model_selection import ParameterSampler # , ParameterGrid
# Use ParameterGrid(...) if grid is small enough to check all combinations
# Final grid search after some iterations
grid = ParameterSampler(
{
"learning_rate": [0.02],
"objective": [objective],
"eval_metric": [metric],
"max_depth": [4, 5, 6],
"min_child_weight": [1, 10],
"colsample_bytree": [0.8, 1],
"subsample": [0.8, 1],
"reg_lambda": [0, 1, 2],
"reg_alpha": [0, 1, 2],
# "tree_method": ["hist"], # when data is large
"min_split_loss": [0, 0.0001],
},
n_iter=20,
)
# Iterate over grid and save relevant information on disk
search = []
for g in tqdm(grid):
cvm = xgb.cv(
params=g,
dtrain=dtrain,
num_boost_round=5000,
nfold=5,
early_stopping_rounds=20,
)
# Keep number of rounds, cv score, train score, and parameters
search.append((len(cvm), *cvm.iloc[-1, [2, 0]], g))
with open(grid_file, "w") as f:
json.dump(search, f)
# Load grid and check (A) sort order and (B) if grid ranges were set reasonable
with open(grid_file) as f:
search = json.load(f)
search_df = pd.DataFrame.from_records(
search, columns=["num_boost_round", "cv_score", "train_score", "params"]
).sort_values("cv_score")
with pd.option_context("display.max_colwidth", None):
display(search_df.head())
| num_boost_round | cv_score | train_score | params | |
|---|---|---|---|---|
| 1 | 430 | 542.998486 | 504.787591 | {'subsample': 0.8, 'reg_lambda': 2, 'reg_alpha': 2, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 10, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1} |
| 18 | 382 | 543.095764 | 505.431964 | {'subsample': 0.8, 'reg_lambda': 0, 'reg_alpha': 2, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 10, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1} |
| 14 | 373 | 543.511145 | 497.539026 | {'subsample': 1, 'reg_lambda': 1, 'reg_alpha': 0, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1} |
| 10 | 390 | 544.175842 | 500.208081 | {'subsample': 1, 'reg_lambda': 2, 'reg_alpha': 1, 'objective': 'reg:squarederror', 'min_split_loss': 0, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1} |
| 0 | 604 | 544.376392 | 504.577338 | {'subsample': 1, 'reg_lambda': 1, 'reg_alpha': 1, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 1, 'max_depth': 5, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1} |
# Step 3: Fit on best params
best = search_df.iloc[0]
fit = xgb.train(params=best.params, dtrain=dtrain, num_boost_round=best.num_boost_round)
Now, the model is ready to be inspected by evaluating
# Interpret model
import numpy as np
import dalex as dx
import plotly
from sklearn.metrics import r2_score, mean_squared_error as mse
from xgboost import plot_importance
plotly.offline.init_notebook_mode() # for saving html with plotly plots
# Test performance
pred_test = fit.predict(dtest)
print(f"Test RMSE: {mse(y_test, pred_test, squared=False):.3f}")
print(f"Test R-squared: {r2_score(y_test, pred_test):.2%}")
Test RMSE: 535.254 Test R-squared: 98.19%
# Variable importance regarding MSE improvement
plot_importance(fit, importance_type="gain", show_values=False, xlabel="Gain")
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Gain', ylabel='Features'>
# Partial dependence plots on training data
def pred_fun(model, df):
X = encoder.transform(df)
dX = xgb.DMatrix(X, feature_names=feature_names)
return model.predict(dX)
exp = dx.Explainer(fit, df_train, predict_function=pred_fun, verbose=False)
pdp_num = exp.model_profile(
variable_splits={"carat": np.linspace(0, 3, 41)},
label="Partial depencence for carat",
verbose=False,
)
pdp_num.plot()
pdp_ord = exp.model_profile(
variable_type="categorical",
variable_splits=dict(zip(ord_features, ord_levels)),
label="Partial depencence for ordinal variables",
verbose=False,
)
pdp_ord.plot(facet_scales="free")
Comment: The resulting model seems comparable to the random forest with slightly better performance.
The next code snipped shows how to use XGBoost's scikit-learn interface.
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline
xgb_model = make_pipeline(
encoder,
XGBRegressor(
**best.params, n_estimators=best.num_boost_round, importance_type="gain"
),
)
xgb_model.fit(df_train, y_train)
print(f"Test R-squared: {xgb_model.score(df_test, y_test):.2%}")
Test R-squared: 98.19%
carat look now?clm and covariates veh_value, veh_body, veh_age, gender, area, and agecat. Use a clean cross-validation/test approach. Use log loss as loss function and evaluation metric. Interpret its results.In this chapter, we have met decision trees, random forests and tree boosting. Single decision trees are very easy to interpret but do not perform too well. On the other hand, tree ensembles like the random forest or gradient boosted trees usually perform very well but are tricky to interpret. We have introduced interpretation tools to look into such "black boxes". The main reason why random forests and boosted trees often provide more accurate models than a linear model lies in their ability to automatically learn interactions and other non-linear effects.
[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification and Regression Trees", Wadsworth, Belmont, CA, 1984.
[2] L. Breiman, "Random forests". In: Machine Learning, 2001, 45(1).
[3] R. Schapire, "The strength of weak learnability", Machine Learning, Vol 5, Nr. 2, 1990.
[4] J. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine", 2001.